Some housekeeping: In case you don’t have the necessary packages installed, run this script to do so.

### Install packages if necessary
list.of.packages <- c("devtools", "rstudioapi", "knitr", "tidyverse", "data.table", "skimr",  "factoextra", "FactoMineR", "GGally", "VIM", "mice", "reshape2")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
rm(list.of.packages, new.packages)

Load standard packages

### Load packages  Standard
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr) # For adittional piping operators
library(data.table) # Good format to work with large datasets
library(skimr) # Nice descriptives

1 Introduction to Artificial Intelligence (AI), Artificial Neural Networks (ANN), and Deep Learning (DL)

1.1 Learning Goals

In this session, you will:

  • Understand the characteristics of deep learning and be able to delineate it.
  • Get introduced to the history and general intuition of artificial neural networks
  • Grasp the benefits of deep learning
  • Understand the basic architecture of deep learning models
  • Get hand on experience in the Keras framework
  • Run your first deep learning models
  • Figure out, why and how these models learn

1.2 First things first, what is it all about?

In the past few years, artificial intelligence (AI) has been a subject of intense media hype. Machine learning, deep learning, and AI come up in countless articles, often outside of technology-minded publications. We’re promised a future of intelligent chatbots, self-driving cars, and virtual assistants-a future sometimes painted in a grim light and other times as utopian, where human jobs will be scarce, and most economic activity will be handled by robots or AI agents. For a future or current practitioner of machine learning, it’s important to be able to recognize the signal in the noise so that you can tell world-changing developments from overhyped press releases.

So, lets first delineate the space, and clarify the terminology.

In fact, we will realize, that machine learning is just a particular subset of the realm of AI technologies and algorithms. Artificial Neural Networks in turn are just a subset of ML techniques. In fact, we already used them briefly in M1. Deep Learning, again, is a subsey of ML, refering to a particular type of neural networks.

1.3 Artificial Neural Networks

1.3.1 The general idea

The term neural network is a reference to neurobiology, but although some of the central concepts in deep learning were developed in part by drawing inspiration from our understanding of the brain, deep-learning models are not models of the brain. They share some very basics, but are highly stilized. Its like saying a paper airplane is an artificial F20 fighter jet. Anyhow, lets look at the general idea:

Somewhat like the brain, we can model decision processes somewhat like that:

We take input cells (neurons, in that case), and connect them to some output cell (in neuroscience, via a synapse). The recieving cell’s input is equal to the submitting cell’s output, weighted by the strenght of the connection. The cell transforms this input via a non-linear activation function to an output, which it in turn submits to connected cells. The simplest of such toy models is called a perceptron:

Obviously, the flexibility of the funtional form a perceptron can model is pretty limited. However, that changes quickly by adding an adittional layer, and some hidden cells in between, which can be understood as latent variables.

1.3.2 A little illustration:

Before we dive into the deep, lets start with exploring a shallow neural network. That we just can do in the well-known caret workflow, deploying the nnet package.

library(caret)
library(nnet)

We will finally use the famous iris dataset, where we aim at predicting species of different iris flowers by their sepal and pedal length and width.

library(datasets)
data <- iris
data %>% summary()
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

We do the usual stuff, create a train and test split, and fit a neural net.

index <- createDataPartition(y = data$Species, p = 0.75, list = FALSE)
train <- data[index,] 
test <- data[-index,] 

fit.nnet <- train(Species ~ ., train, 
              method='nnet', 
              trace = FALSE)

fit.nnet
## Neural Network 
## 
## 114 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 114, 114, 114, 114, 114, 114, ... 
## Resampling results across tuning parameters:
## 
##   size  decay   Accuracy   Kappa    
##   1     0.0000  0.8014663  0.7073338
##   1     0.0001  0.7438311  0.6241282
##   1     0.1000  0.9603840  0.9397786
##   3     0.0000  0.9357681  0.9047005
##   3     0.0001  0.9586462  0.9368412
##   3     0.1000  0.9696232  0.9536513
##   5     0.0000  0.9328588  0.8999568
##   5     0.0001  0.9491327  0.9225214
##   5     0.1000  0.9696232  0.9536513
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 3 and decay = 0.1.

We can take a look at the hyperparameter tuning under the hood.

plot(fit.nnet)

The tunable parameters here are the number of hidden units (cells in layer 2), and the weight decay (how fast the model unlearns, a measure to counter overfitting). More on that lateron.

Lets see how good it predicts on the test data.

table(predict(fit.nnet, test[,1:4]), test$Species)
##             
##              setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         10         0
##   virginica       0          2        12

Booohja, we got it all right!

We can also plot the structure of the neural network to get a first intuition on what’s happening there.

library(NeuralNetTools)
plotnet(fit.nnet$finalModel, alpha=0.6)

We see the variables enter as an input layer, sepperated by a latent (hidden) layer from the output. The model iteratively tunes the weights between them in order to best fit input to output. But ok, lets move on… this was just a warm up.

1.3.3 A brief history of ANNs and DL

Although the core ideas of neural networks were investigated in toy forms as early as the 1950s, yet the approach took decades to get started. For a long time, the missing piece was an efficient way to train large neural networks. This changed in the mid-1980s, when multiple people independently rediscovered the Backpropagation algorithm - a way to train chains of parametric operations using gradient-descent optimization - and started applying it to neural networks.

The first successful practical application of neural nets came in 1989 from Bell Labs, when Yann LeCun combined the earlier ideas of convolutional neural networks and backpropagation, and applied them to the problem of classifying handwritten digits. The resulting network, dubbed LeNet, was used by the United States Postal Service in the 1990s to automate the reading of ZIP codes on mail envelopes (we will take a practical look at that later).

However, then winter came for research on ANNs. After initial success, lacking data, computing power, and the inability to extend the range of applications beyond some simple applications led most researchers abandoning this line of inquiry. Around 2010, only a number of people where still working on neural networks started to make important breakthroughs: the groups of Geoffrey Hinton at the University of Toronto, Yoshua Bengio at the University of Montreal, Yann LeCun at New York University, Dan Ciresan and IDSIA.

In 2011, Dan Ciresan started winning image-classification competitions with GPU-trained deep neural networks - the first practical success of deep learning. Then, Hinton’s group entered the in the yearly large-scale image-classification challenge ImageNet (notoriously difficult at the time, consisting of classifying ca. 1.4m high-res color images in 1.000 categories, top accuracy about 74.3%) with convolutional neural network (more on that later in the module) architectures, which dominated the competitiopn ever since. ConvNets soon became THE algorithm for all computer vision tasks; more generally, they work on all perceptual tasks. in 2015, accuracies of 96.4% made the fomerly daunting classification task on ImageNet to be considered a solved problem.

At the same time, deep learning has also found applications in many other types of problems, such as NLP, and completely replaced SVMs and tree models in a wide range of applications. For instance, CERN for long used tree based models to analyze particle data from their Large Hadron Collider (LHC), yet recently switched to Keras-based deep neural networks due to their higher performance and ease of training on large datasets.

1.3.4 The “deep” in deep learning

So, whats the special thing about deep learning, and why is it deep anyhow? Conceptually, DL is a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations, which is almost exclusively done via a neural network architecture. In contrast, most other ML approaches focus on learning only one or two layers of representations of the data \(\rightarrow\) shallow learning. What do the representations learned by a deep-learning algorithm look like? Let’s examine how a network several layers deep transforms an image of a digit in order to recognize what digit it is.

the network transforms the digit image into representations that are increasingly different from the original image and increasingly informative about the final result. You can think of a deep network as a multistage information-distillation operation, where information goes through successive filters and comes out increasingly purified (that is, useful with regard to some task).

So that’s what deep learning is, technically: a multistage way to learn data representations. It’s a simple idea-but, as it turns out, very simple mechanisms, sufficiently scaled, can end up looking like magic.

2 Introduction to Keras

There are quite a bunch of deep learning frameworks around, from the older Caffee and Theano to Google’s Tensorflow and the newer Pytorch. However, during the rest of this course, 95% of our deep learning exercises will be done using Keras Keras is a deep-learning framework that provides a convenient way to define and train almost any kind of deep-learning model. Keras was initially developed for researchers, with the aim of enabling fast experimentation. It has the following

  • User-friendly API which makes it easy to quickly prototype deep learning models.
  • Built-in support for convolutional networks (for computer vision), recurrent networks (for sequence processing), and any combination of both.
  • Supports arbitrary network architectures: multi-input or multi-output models, layer sharing, model sharing, etc., is therefore appropriate for building essentially any deep learning model, from a memory network to a neural Turing machine.
  • Is capable of running on top of multiple back-ends including TensorFlow, CNTK, or Theano.
  • Allows the same code to run on CPU or on GPU, and has strong multi-GPU, distributed storage, and training support (google cloud, Spark, HDF5…)
  • Can easily be integrated in AI products (Apple CoreML, TensorFlow Android runtime, R or Python webapp backend such as a Shiny or Flask app)

It is widely adapted in academia and industry (Google, Netflix, Uber, CERN, Yelp, Square etc.), and is also a popular framework on Kaggle, the machine-learning competition website, where almost every recent deep-learning competition has been won using Keras models. While Google’s TensorFlow is even more popular, keep in mind that Keras can use Tensorflow (and other popular DL frameworks) as backend, and allows less cumbersome and more high-level

So, after all, Keras represents a wonderful high-level starter, fast and easy implementable, and in most cases flexible enough to do whatever you feel like.

Sidenote, the weird name (Keras) means horn in Greek, and is a reference to ancient Greek literature. Eg., in Odyssey, supernatural “dream spirits”" are divided between those who deceive men with false visions (arriving to Earth through a gate of ivory), and those who announce a future that will come to pass (arriving through a gate of horn). So, enough history lessons, let’s run our first deep learning model!

3 Our first deep learning model

3.1 Introduction

Well, its about time to get serious. We will dive straight in, and use a simple deep learning model on the classical Mnist dataset. This is the original data used by JAn LeCun and his team to fit an ANN that identifies handwritten digits for the US postal service. It consists of quite a bunch of samples of handwritten dicites together with their correct label. The wandwritten dicits here conveniently come as a 28x28 greyscale matrix, making them a good starter to warm up. Lets do that.

3.2 Load our data and get ready

# Load our main tool
library(keras)

# Load our data
mnist <- dataset_mnist()

# sepperate in train and test
train_images <- mnist$train$x
train_labels <- mnist$train$y
test_images <- mnist$test$x
test_labels <- mnist$test$y

Lets take a look at the structure.

glimpse(train_images)
##  int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
glimpse(train_labels)
##  int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...

3.3 Define the Keras model

The workflow will be as follows: First, we’ll feed the neural network the training data, train_images and train_labels. The network will then learn to associate images and labels. Finally, we’ll ask the network to produce predictions for test_images, and we’ll verify whether these predictions match the labels from test_labels.

Let’s build the network - again, remember that you aren’t expected to understand everything about this example yet.

network <- keras_model_sequential() %>%
  layer_dense(units = 512, activation = "relu", input_shape = c(28 * 28)) %>%
  layer_dense(units = 10, activation = "softmax")

Notice that the layer stacking in R is done via the well-known %>%, in Pyhton with .. That’s about the main difference between both implementations.

The core building block of neural networks is the layer, a data-processing module that you can think of as a filter for data. Some data goes in, and it comes out in a more useful form. Specifically, layers extract representations out of the data fed into them - hopefully, representations that are more meaningful for the problem at hand. Most of deep learning consists of chaining together simple layers that will implement a form of progressive data distillation.

Here, our network consists of a sequence of two layers, which are densely connected (layer_dense) neural layers. The second (and last) layer is a 10-way softmax layer, which means it will return an array of 10 probability scores (summing to 1). Each score will be the probability that the current digit image belongs to one of our 10 digit classes. So, we defined a network with overall 634 cells, consisting of:

  1. input layer: 28x28 = 512 cells
  2. intermediate layer : 28x28 = 512 cells
  3. Output layer: 10 cells

To make the network ready for training, we need to pick three more things, as part of the compilation step:

  1. Loss function: How the network will be able to measure its performance on the training data, and thus how it will be able to steer itself in the right direction.
  2. Optimizer: The mechanism through which the network will update itself based on the data it sees and its loss function.
  3. Metrics Here, we’ll only care about accuracy (the fraction of the images that were correctly classified).

While we are already familiar with defining metrics to optimize, defining an optimizer and loss function is new. We will dig into that later. Notice that the compile() function modifies the network in place.

network %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)

Lets inspect our final setup:

summary(network)
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_1 (Dense)                  (None, 512)                   401920      
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 10)                    5130        
## ===========================================================================
## Total params: 407,050
## Trainable params: 407,050
## Non-trainable params: 0
## ___________________________________________________________________________

Well’ we see that a network of this size has quite a large number of trainable parameters (all edge-weights, meaning 512x512 + 512x10).

3.4 Preprocess the data

Before training the model, preprocess the data by reshaping it into the shape the network expects and scaling it so that all values are in the [0, 1] interval. Previously, our training images were stored in an 3d array of shape (60000, 28, 28) of type integer with values in the [0, 255] interval. We transform it into a double array of shape (60000, 28 * 28) with values between 0 and 1.

train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255

test_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images <- test_images / 255

Note that we use the array_reshape() rather than the dim() function to reshape the array. I explain why later, when we talk about tensor reshaping.

Lastly, we also need to categorically encode the labels.

train_labels <- to_categorical(train_labels)
test_labels <- to_categorical(test_labels)

3.5 Run the network

We’re now ready to train the network via Keras fit() function.

#network %>% fit(x = train_images, 
#                y = train_labels, 
#               epochs = 5, 
#                batch_size = 128)

Two quantities are displayed during training: the loss and accuracy of the network over the training data during the subsequent epochs (new training runs after re-adjusting the weights). Notice that the emasures improve in every epoch. We quickly reach an accuracy (98.9% on the training data. Now let’s check that the model performs well on the test set, too:

#metrics <- network %>% evaluate(test_images, test_labels)
#metrics

4 Data representations for neural networks

In the previous example, we started from data stored in multidimensional arrays, also called tensors. In general, most current ML systems use tensors as their basic data structure. Tensors are fundamental to the field-so fundamental that Google’s TensorFlow was named after them. So what’s a tensor?

Tensors are a generalization of vectors and matrices to an arbitrary number of dimensions (note that in the context of tensors, a dimension is often called an axis). In R, vectors are used to create and manipulate 1D tensors, and matrices are used for 2D tensors. For higher-level dimensions, array objects (which support any number of dimensions) are used.

4.1 Key tensor-attributes

A tensor is defined by three key attributes:

  1. Number of axes (rank): For instance, a 3D tensor has three axes, and a matrix has two axes.
  2. Shape: This is an integer vector that describes how many dimensions the tensor has along each axis.
  3. Data type: This is the type of the data contained in the tensor; for instance, a tensor’s type could be integer or double. On rare occasions, you may see a character tensor. But because tensors live in preallocated contiguous memory segments, and strings, being variable-length, would preclude the use of this implementation, they’re rarely used.

To make this more concrete, let’s look back at the data we processed in the MNIST example. Since we already manipulated it, we reload it again in its original shape:

mnist <- dataset_mnist()
train_images <- mnist$train$x
train_labels <- mnist$train$y
test_images <- mnist$test$x
test_labels <- mnist$test$y

Now we display the number of axes of the tensor train_images, and then its shape and datatype:

length(dim(train_images))
## [1] 3
dim(train_images)
## [1] 60000    28    28
typeof(train_images)
## [1] "integer"

So what we have here is a 3D tensor of integers. More precisely, it’s an array of 60,000 matrices of 28 × 28 integers. Each such matrix is a grayscale image, with coefficients between 0 and 255. Thats how they look:

digit <- train_images[5,,]
digit[,7:21] # I crop it a bit, otherwise the columns dont fit on one page
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
##  [1,]    0    0    0    0    0    0    0    0    0     0     0     0     0
##  [2,]    0    0    0    0    0    0    0    0    0     0     0     0     0
##  [3,]    0    0    0    0    0    0    0    0    0     0     0     0     0
##  [4,]    0    0    0    0    0    0    0    0    0     0     0     0     0
##  [5,]    0    0    0    0    0    0    0    0    0     0     0     0     0
##  [6,]    0    0    0    0    0    0    0    0    0     0     0     0     0
##  [7,]    0    0    0    0    0    0    0    0    0     0     0     0     0
##  [8,]    0    0    0    0    0    0   55  148  210   253   253   113    87
##  [9,]    0    0    0    0    0   87  232  252  253   189   210   252   252
## [10,]    0    0    0    4   57  242  252  190   65     5    12   182   252
## [11,]    0    0    0   96  252  252  183   14    0     0    92   252   252
## [12,]    0    0  132  253  252  146   14    0    0     0   215   252   252
## [13,]    0  126  253  247  176    9    0    0    8    78   245   253   129
## [14,]   16  232  252  176    0    0    0   36  201   252   252   169    11
## [15,]   22  252  252   30   22  119  197  241  253   252   251    77     0
## [16,]   16  231  252  253  252  252  252  226  227   252   231     0     0
## [17,]    0   55  235  253  217  138   42   24  192   252   143     0     0
## [18,]    0    0    0    0    0    0    0   62  255   253   109     0     0
## [19,]    0    0    0    0    0    0    0   71  253   252    21     0     0
## [20,]    0    0    0    0    0    0    0    0  253   252    21     0     0
## [21,]    0    0    0    0    0    0    0   71  253   252    21     0     0
## [22,]    0    0    0    0    0    0    0  106  253   252    21     0     0
## [23,]    0    0    0    0    0    0    0   45  255   253    21     0     0
## [24,]    0    0    0    0    0    0    0    0  218   252    56     0     0
## [25,]    0    0    0    0    0    0    0    0   96   252   189    42     0
## [26,]    0    0    0    0    0    0    0    0   14   184   252   170    11
## [27,]    0    0    0    0    0    0    0    0    0    14   147   252    42
## [28,]    0    0    0    0    0    0    0    0    0     0     0     0     0
##       [,14] [,15]
##  [1,]     0     0
##  [2,]     0     0
##  [3,]     0     0
##  [4,]     0     0
##  [5,]     0     0
##  [6,]     0     0
##  [7,]     0     0
##  [8,]   148    55
##  [9,]   253   168
## [10,]   253   116
## [11,]   225    21
## [12,]    79     0
## [13,]     0     0
## [14,]     0     0
## [15,]     0     0
## [16,]     0     0
## [17,]     0     0
## [18,]     0     0
## [19,]     0     0
## [20,]     0     0
## [21,]     0     0
## [22,]     0     0
## [23,]     0     0
## [24,]     0     0
## [25,]     0     0
## [26,]     0     0
## [27,]     0     0
## [28,]     0     0

To make it more tangible, lets plot one:

plot(as.raster(digit, max = 255))

4.2 Tensors and dimensionality

4.2.1 Scalars (0D tensors)

A tensor that contains only one number is called a scalar (or scalar tensor, or zero-dimensional tensor, or 0D tensor). ´R´ doesn’t have a data type to represent scalars (all numeric objects are vectors, matrices, or arrays), but an R vector that’s always length 1 is conceptually similar to a scalar.

4.2.2 Vectors (1D tensors)

A one-dimensional array of numbers is called a vector, or 1D tensor. A 1D tensor is said to have exactly one axis. We can convert the R vector to an array object to inspect its dimensions:

x <- c(12, 3, 6, 14, 10)
str(x)
##  num [1:5] 12 3 6 14 10
dim(as.array(x))
## [1] 5

This vector has five entries and so is called a five-dimensional vector. Don’t confuse a 5D vector with a 5D tensor! A 5D vector has only one axis and has five dimensions along its axis, whereas a 5D tensor has five axes (and may have any number of dimensions along each axis). Dimensionality can denote either the number of entries along a specific axis (as in the case of our 5D vector) or the number of axes in a tensor (such as a 5D tensor), which can be confusing at times. In the latter case, it’s technically more correct to talk about a tensor of rank 5 (the rank of a tensor being the number of axes), but the ambiguous notation 5D tensor is common regardless.

4.2.3 Matrices (2D tensors)

A two-dimensional array of numbers is a matrix, or 2D tensor. A matrix has two axes (often referred to as rows and columns). You can visually interpret a matrix as a rectangular grid of numbers:

x <- matrix(rep(0, 3*5), nrow = 3, ncol = 5)
x
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
dim(x)
## [1] 3 5

4.2.4 Arrays (3D tensors and higher-dimensional tensors)

If you pack such matrices in a new array, you obtain a 3D tensor, which you can visually interpret as a cube of numbers:

x <- array(rep(0, 2*3*2), dim = c(2,3,2))
str(x)
##  num [1:2, 1:3, 1:2] 0 0 0 0 0 0 0 0 0 0 ...
dim(x)
## [1] 2 3 2

By packing 3D tensors in an array, you can create a 4D tensor, and so on. In deep learning, you’ll generally manipulate tensors that are 0D to 4D, although you may go up to 5D if you process video data.

For example, in the case before, we where working with 3d tensors, where the first two where a (greyscale pixel) matrix, and the third the different observations (samples) stacked on each others. 3d tensors are also often used for time series. Whenever time matters in your data (or the notion of sequence order), it makes sense to store it in a 3D tensor with an explicit time axis. Each sample can be encoded as a sequence of vectors (a 2D tensor), and thus a batch of data will be encoded as a 3D tensor.

However, this greyscale raster matrix is somewhat a special case. Images typically have three dimensions: height, width, and color. Therefore, a 2d image would therefore still represent a 3d tensor, and a bunch of them together a 4d tensor. For example, a batch of 128 RGB-color images could be stored in a 4d tensor of shape (128, 256, 256, 3)

You might already see it coming…. vidoes represent a time series of images, therefore would be a 5d tensor. Videos can be seen as series of frames, where each frame can be stored in a 3D tensor (height, width, color_depth), their sequence in a 4D tensor (frames, height, width, color_depth), and thus a batch of different videos in a 5D tensor of shape `(samples, frames, height, width, color_depth).

4.3 Tensor reshaping

Remember that we before did not use the dim<-() but the array_reshape() function to manipulate our input tensors.

train_images <- array_reshape(train_images, c(60000, 28 * 28))
str(train_images)
##  num [1:60000, 1:784] 0 0 0 0 0 0 0 0 0 0 ...
dim(train_images)
## [1] 60000   784

This is so that the data is reinterpreted using row-major semantics (as opposed to Rs default column-major semantics), which is in turn compatible with the way the numerical libraries called by Keras (NumPy, TensorFlow, and so on) interpret array dimensions. You should always use the array_reshape() function when reshaping R arrays that will be passed to Keras.

Reshaping a tensor means rearranging its rows and columns to match a target shape. Naturally, the reshaped tensor has the same total number of coefficients as the initial tensor. Lets do a simple examples:

x <- matrix(c(0, 1,
              2, 3,
              4, 5),
            nrow = 3, ncol = 2, byrow = TRUE)
x
##      [,1] [,2]
## [1,]    0    1
## [2,]    2    3
## [3,]    4    5
x <- array_reshape(x, dim = c(6, 1))
x
##      [,1]
## [1,]    0
## [2,]    1
## [3,]    2
## [4,]    3
## [5,]    4
## [6,]    5
x <- array_reshape(x, dim = c(2, 3))
x
##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    3    4    5

A special case of reshaping that’s commonly encountered is transposition. Transposing a matrix means exchanging its rows and its columns, so that x[i,] becomes x[, i]. The t() function can be used to transpose a matrix:

x <- t(x)
x
##      [,1] [,2]
## [1,]    0    3
## [2,]    1    4
## [3,]    2    5

4.4 Geometric interpretation of tensor operations

Because the contents of the tensors manipulated by tensor operations can be interpreted as coordinates of points in some geometric space, all tensor operations have a geometric interpretation. For instance, let’s consider addition. We’ll start with the following vector: A = [0.5, 1.0]. It’s a point in a 2D space, but can also be under stood as a vector leading from the origin to this point.

Let’s consider a new point, B = [1, 0.25], which we willll add to the previous one. This is done geometrically by chaining together the vector arrows, with the resulting location being the vector representing the sum of the previous two vectors

In general, elementary geometric operations such as affine transformations, rotations, scaling, and so on can be expressed as tensor operations. For instance, a rotation of a 2D vector by an angle theta can be achieved via a dot product with a 2 × 2 matrix R = [u, v], where u and v are both vectors of the plane: u = [cos(theta), sin(theta)] and v = [-sin(theta), cos(theta)].

Some linear, vector and matrix algebra knowledge might light up again in your brain, right? Good! While for many out-of-the box “run-this-model” operations on tabular data, you might not need it, it will be necessary in case you need to tinker and customize a bit at your models to squeeze out a bit more accuracy. Anyhow, a bit of a refresher in linear algebra would also support your intuition on what’s going on under the hood of deep learning. We, however, leave it like that for now.

5 Learning in Neural Networks

5.1 The learning problem

So, now we know what the deep means, and how such a deep neutwork is structured. However, we still do not know so much about the learning. Generally, learning appears in the network by adjusting the weights between the different cells. But how does that happen?

Lets take first a step back. Every cell gets inputs from the connected other cells on lower layers which are activated, where the intensity of the input is scaled by the weight of the connection. If the cell gets activated on its own is determined by its activation function, a mathematical transformation of its inputs, where the cell (usually) activates above a certain threshhold. This can be done in different ways, eg, a rectified linear unit (ReLU) or a sigmoid, which we already know from logistic regression models

Back to our network: Here, each neural layer from our first network example transforms its input data as follows:

output = relu(dot(W, input) + b)

In this expression, W and b are tensors that are attributes of the layer. They’re called the weights or trainable parameters of the layer (the kernel and bias attributes, respectively). These weights contain the information learned by the network from exposure to training data. The activation of every cell in the layer is therefore dependent on the multiplication of the corresponding input and weight tensor (dot(W, input)), pluss the bias (b), a constant which influences the tendency to activate.

Initially, these weight matrices are filled with small random values (a step called random initialization). Of course, there is no reason to expect that relu(dot(W, input) + b), when W and b are random, will yield any useful representations. What comes next is to gradually adjust these weights, based on some feedback signal (provided by the loss function dicussed later). This gradual adjustment, also called training, is basically the learning that ML is all about. This happens within what’s called a training loop, which works as follows. Repeat these steps in a loop, as long as necessary:

  1. Draw a batch of training samples x and corresponding targets y.
  2. Run the network on x (a step called the forward pass) to obtain predictions y_pred.
  3. Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y.
  4. Update all weights of the network in a way that slightly reduces the loss on this batch.

You’ll eventually end up with a network that has a very low loss on its training data: a low mismatch between predictions y_pred and expected targets y. The network has “learned” to map its inputs to correct targets. From afar, it may look like magic, but when you reduce it to elementary steps, it turns out to be simple.

Step 1 sounds easy enough-just I/O code. Steps 2 and 3 are merely the application of a handful of tensor operations, so you could implement these steps purely from what you learned in the previous section. The difficult part is step 4: updating the network’s weights. Given an individual weight coefficient in the network, how can you compute whether the coefficient should be increased or decreased, and by how much?

The currently dominnat approach to do so is to take advantage of the fact that all operations used in the network are differentiable, and compute the gradient of the loss with regard to the network’s coefficients. You can then move the coefficients in the opposite direction from the gradient, thus decreasing the loss.

5.2 Refresher What’s a derivative?

Consider a continuous, smooth function f(x) = y, mapping a real number x to a new real number y. Because the function is continuous, a small change in x can only result in a small change in y. Let’s say you increase x by a small factor epsilon_x: this results in a small epsilon_y change to y:

f(x + epsilon_x) = y + epsilon_y

In addition, because the function is smooth (its curve has no abrupt angles), when epsilon_x is small enough, around a certain point p, it’s possible to approximate f as a linear function of slope a, so that epsilon_y becomes a * epsilon_x:

f(x + epsilon_x) = y + a * epsilon_x

Obviously, this linear approximation is valid only when x is close enough to p. The slope a is called the derivative of f in p.

For every differentiable function f(x), there exists a derivative function f'(x) that maps values of x to the slope of the local linear approximation of f in those points. If you’re trying to update x by a factor epsilon_x in order to minimize f(x), and you know the derivative of f, then your job is done: the derivative completely describes how f(x) evolves as you change x. If you want to reduce the value of f(x), you just need to move x a little in the opposite direction from the derivative. It is helpful to track cregions where f'(x)==0, since they indicate the directional change of the curvature, and therefore local maxima or minima of f(x).

Btw: The curvature of f(x) is found in its second derivative, f''(x).

5.3 Derivative of a tensor operation: the gradient

A gradient is the derivative of a tensor operation. Consider an input vector x, a matrix W, a target y, and a loss function (to be explained later) loss. You can use W to compute a target candidate y_pred, and compute the loss, or mismatch, between the target candidate y_pred and the target y:

y_pred = dot(W, x) loss_value = loss(y_pred, y)

If the data inputs x and y are frozen, then this can be interpreted as a function mapping values of W to loss values:

loss_value = f(W)

5.3.1 Stochastic gradient descent

Since f'(x)==0indicates a local minimum, to find the global one we “only” need to identify all f'(x)==0 regions and check which has the lowest f(x). Applied to a neural network, that means finding analytically the combination of weight values that yields the smallest possible loss.

To do so, weuse the four-step algorithm outlined earlier: modify the parameters little by little based on the current loss value on a random batch of data. Because you’re dealing with a differentiable function, you can compute its gradient, which gives you an efficient way to implement step 4. If you update the weights in the opposite direction from the gradient, the loss will be a little less every time:

  1. Draw a batch of training samples x and corresponding targets y.
  2. Run the network on x to obtain predictions y_pred.
  3. Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y.
  4. Compute the gradient of the loss with regard to the network’s parameters (a backward pass).
  5. Move the parameters a little in the opposite direction from the gradient-for example, W = W - (step * gradient) - thus reducing the loss on the batch a bit.

Easy enough! What we just described is **stochastic gradient descent’ (mini-batch SGD)’’. The term stochastic refers to the fact that each batch of data is drawn at random (stochastic is a scientific synonym of random).

As you can see, intuitively it’s important to pick a reasonable value for the step, which we call the learning rate. If it’s too small, the descent down the curve will take many iterations, and it could get stuck in a local minimum. If step is too large, your updates may end up taking you to completely random locations on the curve.

5.3.2 The Backpropagation algorithm: Chaining derivatives

So, now we are almost there. We know now how to minimize the loss function within a layer. However, the issue with deep learning is… well… that you want to sue many layers. How do we move on from here? Indeed, the rise of deep learning had to wait till the implementation of a efficient way to train multi-layered networks.

Luckily, we now have it, and its called backpropagation. A nd its actually pretty simple. While an enourmeous task in terms of number of calculations, the math behind it chould be acessible by highschool students. Just imagine, we have a 4-layered network, connected by 4 weight tensors.

f(W1, W2, W3) = a(W1, b(W2, c(W3)))

Calculus tells us that such a chain of functions can be differentiated using the following identity, called the chain rule (ohh, dark memories, right?):

f(g(x)) = f'(g(x)) * g'(x)

Applying the chain rule to the computation of the gradient values of a neural network was one simple but brillinat ideas. Now we can start with the last output layer, and propagate the loss via the chain rule step-by-step back over all weights in the leyer below, and then the one below etc., and adjust the weights accordingly. Backpropagation starts with the final loss value and works backward from the top layers to the bottom layers, applying the chain rule to compute the contribution that each parameter had in the loss value.

Sounds like a hell to calculate by hand, right? Would be, but dont worry, you will not have to.

6 The parts of a deep learning model

6.1 Overview

As we understand by now training a neural network revolves around the following objects:

  1. Layers, which are combined into a network (or model)
  2. The input data and corresponding targets
  3. The loss function, which defines the feedback signal used for learning
  4. The optimizer, which determines how learning proceeds

In interaction, it can be illustrated like this: The network, composed of layers that are chained together, maps the input data to predictions. The loss function then compares these predictions to the targets, producing a loss value: a measure of how well the network’s predictions match what was expected. The optimizer uses this loss value to update the network’s weights.

Let’s take a closer look at layers, networks, loss functions, and optimizers.

6.2 Layers: the building blocks of deep learning

A layer is a data-processing module that takes as input one or more tensors and that outputs one or more tensors. Some layers are stateless, but more frequently layers have a state: the layer’s weights, one or several tensors learned with stochastic gradient descent, which together contain the network’s knowledge.

Different layers are appropriate for different tensor formats and different types of data processing. For instance, simple vector data, stored in 2D tensors of shape (samples, features), is often processed by densely connected layers, also called fully connected or dense layers (the layer_dense function in Keras). Sequence data, stored in 3D tensors of shape (samples, timesteps, features), is typically processed by recurrent layers such as layer_lstm. Image data, stored in 4D tensors, is usually processed by 2D convolution layers (layer_conv_2d). All that will be introduced in later sessions.

You can think of layers as the LEGO bricks of deep learning, a metaphor that is made explicit by frameworks like Keras. Building deep-learning models in Keras is done by clipping together compatible layers to form useful data-transformation pipelines. The notion of layer compatibility here refers specifically to the fact that every layer will only accept input tensors of a certain shape and will return output tensors of a certain shape. Consider the following example:

layer <- layer_dense(units = 32, input_shape = c(784))

We’re creating a layer that will only accept as input 2D tensors where the first dimension is 784 (the first dimension, the batch dimension, is unspecified, and thus any value would be accepted). This layer will return a tensor where the first dimension has been transformed to be 32.

Thus this layer can only be connected to a downstream layer that expects 32-dimensional vectors as its input. When using Keras, you don’t have to worry about compatibility, because the layers you add to your models are dynamically built to match the shape of the incoming layer. For instance, suppose you write the following:

model <- keras_model_sequential() %>%
  layer_dense(units = 32, input_shape = c(784)) %>%
  layer_dense(units = 32)

The second layer didn’t receive an input shape argument-instead, it automatically inferred its input shape as being the output shape of the layer that came before.

Picking the right network architecture is more an art than a science; and although there are some best practices and principles you can rely on, only practice can help you become a proper neural-network architect. Indeed, the community so far developed a multitude of different architectures, all more or less suitable for a certain task and data structure. Again,the most important ones will be introduced in later sessions.

Here, we will simit ourself to a simple feed-forward network, where every layer is only connected to the following one. For now, there are two key architecture decisions to be made about such a stack of dense layers:

  1. How many layers to use
  2. How many hidden units to choose for each layer
  3. Which activation function to use

6.3 Activation functions

As we already saw, we can define activation functiopns for the different activation functions for every layer. While we within the intermediate layers seldomely switch between different activation functions, the one we define for the output layer critically depends on the shape of our desired output data.

A brief reminder: Activation functions transform the input weights of a cell to its output. Without them, the dense layer would consist of two linear operations-a dot product and an addition:

output = dot(W, input) + b

So the layer could only learn linear transformations (affine transformations) of the input data: the hypothesis space of the layer would be the set of all possible linear transformations of the input data. Such a hypothesis space is too restricted and wouldn’t benefit from multiple layers of representations, because a deep stack of linear layers would still implement a linear operation: adding more layers wouldn’t extend the hypothesis space.

In order to get access to a much richer hypothesis space that would benefit from deep representations, you need a non-linearity, or activation function.

relu is the most popular activation function in deep learning, but there are many other candidates, which all come with similarly strange names: prelu, elu, and so on. A relu (rectified linear unit) is a function meant to zero out negative values, and commonly used for intermediate layers.

Our output layer, however, should model a binary choice (yes/no classification). For such a model, we would in a 2-class problem commonly use a a sigmoid function, which we already know from logistic regression models. It “squashes” arbitrary values into the [0, 1] interval, outputting something that can be interpreted as a probability.

However, since we have a multi-class prediction problem, we choose softmax, just giving us as uptput the cell with the maximum value (therefore the predicted class).

6.4 Loss functions and optimizers: keys to configuring the learning process

Once the network architecture is defined, you still have to choose two more things:

  • Loss function (objective function): The quantity that will be minimized during training. It represents a measure of success for the task at hand.
  • Optimizer: Determines how the network will be updated based on the loss function. It implements a specific variant of stochastic gradient descent (SGD).

Choosing the right objective and function for the right problem is extremely important: your network will take any shortcut it can, to minimize the loss; so if the objective doesn’t fully correlate with success for the task at hand, your network will end up doing things you may not have wanted. Imagine a stupid, omnipotent AI trained via SGD, with this poorly chosen objective function: “maximizing the average well-being of all humans alive.” To make its job easier, this AI might choose to kill all humans except a few and focus on the well-being of the remaining ones-because average well-being isn’t affected by how many humans are left. That might not be what you intended! Just remember that all neural networks you build will be just as ruthless in lowering their loss function-so choose the objective wisely, or you’ll have to face unintended side effects.

Fortunately, when it comes to common problems such as classification, regression, and sequence prediction, there are simple guidelines you can follow to choose the correct loss.

Take this rule-of-thumb table as a good starter:

7 Reviewing our initial example

Let’s go back to the first example and review each piece of it in the light of what we have learned up to now: This was the input data:

mnist <- dataset_mnist()

train_images <- mnist$train$x
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255

test_images <- mnist$test$x
test_images <- array_reshape(test_images, c(10000, 28 * 28))
test_images <- test_images / 255

Now you understand that the input images are stored in tensors of shape (60000, 784) (training data) and (10000, 784) (test data), respectively.

This was our network:

network <- keras_model_sequential() %>%
  layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
  layer_dense(units = 10, activation = "softmax")

Now you understand that this network consists of a chain of two dense layers, that each layer applies a few simple tensor operations to the input data, and that these operations involve weight tensors. We know that layer_dense() creates fully connected layers, so there exists a weight between every element of one with every element of the following layer.

Weight tensors, which are attributes of the layers, are where the knowledge of the network persists. We know the 2nd layer has 512 cells, the final output layer 10 (equal to the number of classes to predict). Finally, we know that every cell also contains a non-linear activation function, such as relu, sigmoid, or softmax

This was the network-compilation step:

network %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy"
  )

Now you understand that ´categorical_crossentropy´ is the loss function that’s used as a feedback signal for learning the weight tensors, and which the training phase will attempt to minimize. You also know that this reduction of the loss happens via mini-batch stochastic gradient descent. The exact rules governing a specific use of gradient descent are defined by the ´rmsprop´ optimizer passed as the first argument.

Finally, this was the training loop:

network %>% fit(x = train_images, 
                y = train_labels, 
                epochs = 5, 
                batch_size = 128)

Now you understand what happens when you call fit(): the network will start to iterate on the training data in mini-batches of ´128´ samples, 5 times over (each iteration over all the training data is called an ´epoch´). At each iteration, the network will compute the gradients of the weights with regard to the loss on the batch, and update the weights accordingly. After these 5 epochs, the network will have performed ´2,345´ gradient updates (469 per epoch), and the loss of the network will be sufficiently low that the network will be capable of classifying handwritten digits with high accuracy.

At this point, you already know most of what there is to know about the basics of neural networks.

However, there is still some stuff to come, namely:

  1. How to use architectures other that the simple feed-forward one.
  2. How to fight overfitting
  3. How to specify training routines and parameter grid-search
  4. And some more…

But for that, there will be other sessions top come…

8 Your turn

Well, its that point of the session again… HERE some well-known data, but new model architectures await you…